9 research outputs found

    Malware Classification based on Call Graph Clustering

    Full text link
    Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and DBSCAN. Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.Comment: This research has been supported by TEKES - the Finnish Funding Agency for Technology and Innovation as part of its ICT SHOK Future Internet research programme, grant 40212/0

    Analyzing and comparing arrangements of temporal intervals

    Get PDF
    This thesis focuses on comparing and analysing arrangements of temporal intervals. Such arrangements are sets of concurrent events that are not instantaneous, but are characterized by duration. We study two major problems. The first problem is comparing arrangements of event-intervals and acquiring their distance. To the best of our knowledge, we are the first to formally define the problem. Furthermore, we present three polynomial-time distance functions which we study and benchmark through rigorous experimentation. The proposed methods were tested on three datasets: American Sign Language utterances, sensor data and Hepatitis patient data. In addition, we provide a linear-time lower bound for one of the distance measures. The distance measures can be applied to event-interval sequences, too. In this case, neither the event-interval durations nor the absolute time values are considered. The second problem which we study is finding the longest common sub-pattern (LCSP) of arrangements of temporal intervals. We prove hardness results for the problem and devise an exact algorithm for computing the LCSP of pairs of arrangements

    Advances in Analysing Temporal Data

    No full text
    Modern technical capabilities and systems have resulted in an abundance of data. A significant portion of all that data is of temporal nature. Hence, it becomes imperative to design effective and efficient algorithms, and solutions that enable searching and analysing large databases of temporal data. This thesis contains several contributions related to the broad scientific field of temporal-data analysis. First, we present a distance function for pairs of event-interval sequences, together with proofs of important properties, such as that the function is a metric, and a lower-bounding function. An embedding-based indexing method is proposed for searching through large databases of event-interval sequences, under this distance function. Second, we study the problem of subsequence search for event-interval sequences. This includes hardness results, an exact worst-case exponential-time algorithm, two upper bounds and a scheme for approximation algorithms. In addition, an equivalence is established between graphs and event-interval sequences. This equivalence allows to derive hardness results for several problems of event-interval sequences. Most importantly, it raises the question which techniques, results, and methods from each of the fields of graph mining and temporal data mining can be applied to the other that would advance the current state of the art. Third, for the problem of subsequence search, we propose an indexing method based on decomposing event-interval sequences into 2-interval patterns. The proposed indexing method is benchmarked against other approaches. In addition, we examine different variations of the problem and propose exact algorithms for solving them. Fourth, we describe a complete system that enables the clustering of a stream of graphs. The graphs are clustered into groups based on their distances, via approximating the graph edit distance. The proposed clustering algorithm achieves a good clustering with few graph comparisons. The effectiveness and usefulness of the systems is demonstrated by clustering function call-graphs of binary executable files for the purpose of malware detection. Finally, we solve the problem of summarising temporal networks. We assume that networks operate in certain modes and that the total sequence of interactions can be modelled as a series of transitions between these modes. We prove hardness results and provide heuristic procedures for finding approximate solutions. We demonstrate the quality of our methods via benchmarking and performing case-studies on datasets taken from sports and social networks

    Temporal Networks - Football, Handball, Basketball

    No full text
    This is a collection of datasets used for research in the field of Temporal Networks We have first used these datasets in the following publication: O.Kostakis, N.Tatti, A.Gionis, "Discovering recurrent activity in temporal networks", Data Mining and Knowledge Discovery, Special Issue in Sports Analytics, 2016. In summary, this collection contains three different datasets. The first is data about all matches in the 1996-'97 English Premier League. The second dataset contains a temporal network corresponding to team-passing activity of a handball team. Finally, the third dataset contains play-by-play information for 1101 basketball matches of the 2014-'15 NBA season. Within each folder, you will find a separate README file for each dataset. Disclaimer: We do not claim to have produced or own the data. We do not claim the correctness of the data. We provide the data only for reasons related to Research, including but not limited to research reproducibility

    Temporal Networks - Football, Handball, Basketball

    No full text
    <p>This is a collection of datasets used for research in the field of Temporal Networks</p> <p> </p> <p>We have first used these datasets in the following publication:</p> <p>O.Kostakis, N.Tatti, A.Gionis, "Discovering recurrent activity in temporal networks", Data Mining and Knowledge Discovery, Special Issue in Sports Analytics, 2016.</p> <p> </p> <p>In summary, this collection contains three different datasets. The first is data about all matches in the 1996-'97 English Premier League. The second dataset contains a temporal network corresponding to team-passing activity of a handball team. Finally, the third dataset contains play-by-play information for 1101 basketball matches of the 2014-'15 NBA season.  Within each folder, you will find a separate README file for each dataset.</p> <p> </p> <p>Disclaimer:</p> <p>We do not claim to have produced or own the data. We do not claim the correctness of the data.</p> <p>We provide the data only for reasons related to Research, including but not limited to research reproducibility.</p

    Improved call graph comparison using simulated annealing

    No full text
    \u3cp\u3eThe amount of suspicious binary executables submitted to Anti-Virus (AV) companies are in the order of tens of thousands per day. Current hash-based signature methods are easy to deceive and are inefficient for identifying known malware that have undergone minor changes. Examining malware executables using their call graphs view is a suitable approach for overcoming the weaknesses of hash-based signatures. Unfortunately, many operations on graphs are of high computational complexity. One of these is the Graph Edit Distance (GED) between pairs of graphs, which seems a natural choice for static comparison of malware. We demonstrate how Simulated Annealing can be used to approximate the graph edit distance of call graphs, while outperforming previous approaches both in execution time and solution quality. Additionally, we experiment with opcode mnemonic vectors to reduce the problem size and examine how Simulated Annealing is affected.\u3c/p\u3
    corecore